A Data Prism: Semi-Verified Learning in the Small-Alpha Regime

نویسندگان

  • Michela Meister
  • Gregory Valiant
چکیده

We consider a simple model of unreliable or crowdsourced data where there is an underlying set of n binary variables, each “evaluator” contributes a (possibly unreliable or adversarial) estimate of the values of some subset of r of the variables, and the learner is given the true value of a constant number of variables. We show that, provided an α-fraction of the evaluators are “good” (either correct, or with independent noise rate p < 1/2), then the true values of a (1 − ǫ) fraction of the n underlying variables can be deduced as long as α > 1/(2 − 2p). For example, if each “good” worker evaluates a random set of 10 items and there is no noise in their responses, then accurate recovery is possible provided the fraction of good evaluators is larger than 1/1024. This result is optimal in that if α ≤ 1/(2− 2p), the large dataset can contain no information. This setting can be viewed as an instance of the semi-verified learning model introduced in [3], which explores the tradeoff between the number of items evaluated by each worker and the fraction of “good” evaluators. Our results require the number of evaluators to be extremely large, > n, although our algorithm runs in linear time, Or,ǫ(n), given query access to the large dataset of evaluations. This setting and results can also be viewed as examining a general class of semi-adversarial CSPs with a planted assignment. This extreme parameter regime, where the fraction of reliable data is small (inverse exponential in the amount of data provided by each source), is relevant to a number of practical settings. For example, settings where one has a large dataset of customer preferences, with each customer specifying preferences for a small (constant) number of items, and the goal is to ascertain the preferences of a specific demographic of interest. Our results show that this large dataset (which lacks demographic information) can be leveraged together with the preferences of the demographic of interest for a constant number of randomly selected items, to recover an accurate estimate of the entire set of preferences, even if the fraction of the original dataset contributed by the demographic of interest is inverse exponential in the number of preferences supplied by each customer. In this sense, our results can be viewed as a “data prism” allowing one to extract the behavior of specific cohorts from a large, mixed, dataset.

برای دانلود رایگان متن کامل این مقاله و بیش از 32 میلیون مقاله دیگر ابتدا ثبت نام کنید

ثبت نام

اگر عضو سایت هستید لطفا وارد حساب کاربری خود شوید

منابع مشابه

Qualitative Model of Strategic Partnership in Small and Medium Enterprises

The research purpuse was to developing a qualitative model of strategic partnership for small and medium enterprises in the software industry. The research method was Descriptive-Analytic and has been done through Delphi teqnique. The Experts Panel of Delphi consists of 20 experts in the field of business management, entrepreneurship management, strategic management, and software industry that ...

متن کامل

THE EFFECT OF TRANSCRANIAL ALTERNATING CURRENT STIMULATION (TACS) ON ATTENTION IN STUDENTS WITH SPECIAL LEARNING DISORDER: SEMI-EXPERIMENTAL STUDY

Background & Aims: This main aim of this study is to investigate the effectiveness of Transcranial Alternating Current Stimulation (tACS) on attention in students with specific learning disorder. Materials & Methods: Twenty students of elementary school with specific learning disorders were selected through purposive sampling method and randomly divided into two groups, the experimental and co...

متن کامل

Composite Kernel Optimization in Semi-Supervised Metric

Machine-learning solutions to classification, clustering and matching problems critically depend on the adopted metric, which in the past was selected heuristically. In the last decade, it has been demonstrated that an appropriate metric can be learnt from data, resulting in superior performance as compared with traditional metrics. This has recently stimulated a considerable interest in the to...

متن کامل

Semi-Supervised Learning Based Prediction of Musculoskeletal Disorder Risk

This study explores a semi-supervised classification approach using random forest as a base classifier to classify the low-back disorders (LBDs) risk associated with the industrial jobs. Semi-supervised classification approach uses unlabeled data together with the small number of labelled data to create a better classifier. The results obtained by the proposed approach are compared with those o...

متن کامل

The Effectiveness of Training in Cognitive-Metacognitive Strategies upon the Cognitive Load and Working Memory of Elementary School Students with Specific Learning Difficulties in Reading

This research has aimed to study the effectiveness of training in cognitive-metacognitive strategies upon the cognitive load and working memory of senior primary school students with specific learning difficulties in reading by means of a semi-experimental method and using pre-tests and post-tests with a control group. The statistical population consisted of senior primary school students from ...

متن کامل

ذخیره در منابع من


  با ذخیره ی این منبع در منابع من، دسترسی به آن را برای استفاده های بعدی آسان تر کنید

عنوان ژورنال:
  • CoRR

دوره abs/1708.02740  شماره 

صفحات  -

تاریخ انتشار 2017